We're going to use GitHubArchive to retrieve a lot of data from GitHub activity stream. GitHubArchive provides several file whose names are http://data.githubarchive.org/{year}-{month}-{day}-{hour}.json.gz that we are going to retrieve. GitHubArchive provides those files since 2011-12-02, but the file format has changed at the beginning of 2015.
We first generate a list of links we are interested in, say from 2013-01-01 to 2013-01-05 excluded.
In [ ]:
from dateutil import rrule
from datetime import date
start_date = date(2013, 1, 1)
end_date = date(2013, 1, 5)
date_list = rrule.rrule(rrule.HOURLY, dtstart=start_date, until=end_date)
link_format = 'http://data.githubarchive.org/{year}-{month:0>2}-{day:0>2}-{hour}.json.gz'
links = [link_format.format(year=d.year, month=d.month, day=d.day, hour=d.hour) for d in date_list]
if __name__ == '__main__':
print '\n'.join(links)
The easiest way to retrieve several files is using wget
(or requests
module in Python). Assuming you stored the list of links in a file links.txt:
wget -i links.txt -nc -c
Those files are mainly JSON strings that are gzipped. It is easy to get its content using the following function:
In [ ]:
import json
import gzip
def get_content_from_file(filepath):
"""
Return a list of JSON structures that are contained in the file
described by filepath. This function expects that the file is
gzipped.
"""
with gzip.GzipFile(filepath) as f:
try:
content = map(json.loads, f.readlines())
except Exception as e:
# Anything related to JSON, like UnicodeError, ValueError, etc.
pass # ... for sure, it's a bad idea...
return content
Say we are interested to get the content of every file that we downloaded:
In [ ]:
import os
# Assuming we are inside the right directory and there is no other files in it.
filename_list = os.listdir('.')
# 'activity' will store the entire activity stream
activity = []
for filename in filename_list:
activity += get_content_from_file(filename)
# Now, activity contains the entiere activity stream
print activity[0]
This content can then be put in a (relational or not) database. An event looks like:
{u'actor': u'lastr2d2',
u'actor_attributes': {u'email': u'lastr2d2@gmail.com',
u'gravatar_id': u'2a8a2ef556894cb1b6945a8c471bc4e9',
u'login': u'lastr2d2',
u'name': u'Wayne Wang',
u'type': u'User'},
u'created_at': u'2014-01-01T01:01:58-08:00',
u'payload': {u'head': u'afa9b3ac304d6ab92fd7689d1604f240b8f4ae38',
u'ref': u'refs/heads/master',
u'shas': [[u'afa9b3ac304d6ab92fd7689d1604f240b8f4ae38',
u'lastr2d2@gmail.com',
u'updated minifized version',
u'Wayne Wang',
True]],
u'size': 1},
u'public': True,
u'repository': {u'created_at': u'2013-11-19T00:01:51-08:00',
u'description': u'My userscript for douban.fm',
u'fork': False,
u'forks': 0,
u'has_downloads': True,
u'has_issues': True,
u'has_wiki': True,
u'id': 14517966,
u'language': u'JavaScript',
u'master_branch': u'master',
u'name': u'scripts-doubanfm',
u'open_issues': 0,
u'owner': u'lastr2d2',
u'private': False,
u'pushed_at': u'2014-01-01T01:01:57-08:00',
u'size': 128,
u'stargazers': 0,
u'url': u'https://github.com/lastr2d2/scripts-doubanfm',
u'watchers': 0},
u'type': u'PushEvent',
u'url': u'https://github.com/lastr2d2/scripts-doubanfm/compare/ba4d721b3d...afa9b3ac30'}
Notice that payload
has not a fixed schema.
We filtered out the events (and by extension their repositories) that are related to R language. This can be easily done by filtering on repository.language
key. With such a list of R repositories, we are interested in identifying which ones are R packages. To do this, we filtered out the repositories that contains a DESCRIPTION file at its root.
If repository is the (full) name of the repository, then this file could possibly be retrieved from https://raw.githubusercontent.com/_repository_/master/DESCRIPTION
In [ ]:
import requests
url = 'https://raw.githubusercontent.com/{repository}/master/DESCRIPTION'
def get_description_file_for(repository):
"""
Given the full name of a github repository, return the DESCRIPTION file
if it exists or None otherwhise.
"""
result = requests.get(url.format(repository))
if status_code == 200:
return r.content
else:
return None
We put all the data collected from 2013 and 2014 in a MongoDB datastore. Our MongoDB contains a collection events
with every event from GitHub Archive related to R. We spread the data into several collections: events
contains the raw event, repository
contains event.repository
, payload
contains event.payload
, etc. for every subdocument contained in each event
.
Here's the result of some queries, FYI:
> db.events.count()
1016423
> db.events.distinct('repository.url').length
121385
> db.events.distinct('repository.id').length
118675
> db.events.distinct('repository.name').length
43164
Notice that, at the same time, https://github.com/search?utf8=%E2%9C%93&q=language%3AR&type=Repositories&ref=searchresults shows 67275 repositories. This can be explained as a large majority of the 121385 repositories we have have been deleted since the events were collected.
Moreover, we added a collection descriptionfile
. This collection contains {_id: URL, file: CONTENT}
where URL
is the URL of a R repository, and CONTENT
is the content of the DESCRIPTION file if any.
> db.descriptionfile.find({file: {$exists: true}}).length()
19052
This is, we collected and identified 19052 packages that are related to R. This includes redundant repositories, like rpkg/*
which is an alias for cran/*
.
> db.descriptionfile.find({_id: { $regex: /^https:\/\/github.com\/cran/ } } ).length()
6007
> db.descriptionfile.find({_id: { $regex: /^https:\/\/github.com\/rpkg/ } } ).length()
4423
Say we are interested in identifying which are the R packages that are hosted by CRAN on Github:
In [ ]:
import pymongo
# Assuming we are locally running a MongoDB instance
db = pymongo.MongoClient().r
# Contains 3-uples ('https://github.com', OWNER_NAME, REPOSITORY_NAME)
packages = map(lambda x: x['_id'].rsplit('/', 2), list(db.descriptionfile.find({'file': {'$exists': True}}, fields=['_id'])))
# Filter CRAN, RPKG and the others
cran_names = map(lambda x: x[2], filter(lambda x: x[1] == 'cran', packages))
rpkg_names = map(lambda x: x[2], filter(lambda x: x[1] == 'rpkg', packages))
other_names = map(lambda x: x[2], filter(lambda x: x[1] != 'rpkg' and x[1] != 'cran', packages))
# len(other_names) == 8643
cran_set = set(cran_names)
rpkg_set = set(rpkg_names)
other_set = set(other_names)
# Names that are NOT in cran_set and rpkg_set: 5068 items
outside_only = other_set.difference(cran_set).difference(rpkg_set)
# Names that are in cran_set: 1210 items
cran_too = other_set.intersection(cran_set)
# Names that are in rpkg_set: 753 items
rpkg_too = other_set.intersection(rpkg_set)
# Names from cran that are in rpkg too: 4345 items (rpkg has 4410 items!)
rpkg_cran = rpkg_set.intersection(cran_set)